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Abstract. Given a pattern x of length m and a text y of length n, both 
over an ordered alphabet, the order-preserving pattern matching problem 
consists in Ending all substrings of the text with the same relative order 
as the pattern. It is an approximate variant of the well known exact 
pattern matching problem which has gained attention in recent years. 
This interesting problem finds applications in a lot of fields as time series 
analysis, like share prices on stock markets, weather data analysis or to 
musical melody matching. In this paper we present two new filtering 
approaches which turn out to be much more effective in practice than 
the previously presented methods. From our experimental results it turns 
out that our proposed solutions are up to 2 times faster than the previous 
solutions reducing the number of false positives up to 99%. 


1 Introduction 

Given a pattern x of length m and a text y of length n, both over a common 
alphabet S, the exact string matching problem consists in finding all occur¬ 
rences of the string x in y. String matching is a very important subject in the 
wider domain of text processing and algorithms for the problem are also basic 
components used in the implementations of practical softwares existing under 
most operating systems. Moreover, they emphasize programming methods that 
serve as paradigms in other fields of computer science. Finally they also play an 
important role in theoretical computer science by providing challenging prob¬ 
lems. The worst case lower bound of the string matching problem is 0(n) and 
was achieved the first time by the well known algorithm by Knuth, Morris and 
Pratt [5] . However many string matching algorithms have been also developed to 
obtain sublinear 0{n\ogm/m) performance on average. Among them the Boyer- 
Moore algorithm ^ deserves a special mention, since it has been particularly 
successful and has inspired much work. 

The order-preserving pattern matching problem [2, 3, 8, 9] (OPPM in short) 
is an approximate variant of the exact pattern matching problem which has 
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Fig. 1. Example of a pattern x of length 5 over an integer alphabet with two order 
preserving occnrrences in a text y of length 17, at positions 3 and 10. 


gained attention in recent years. In this variant the characters of x and y are 
drawn from an ordered alphabet E with a total order relation defined on it. The 
task of the problem is to find all substrings of the text with the same relative 
order as the pattern. 

For instance the relative order of the sequence x = (6, 5,8,4, 7) is the sequence 
(3,1, 0,4,2) since 6 has rank 3, 5 as rank 1, and so on. Thus x occurs in the string 
y = (8,11,10,16,15, 20,13,17,14,18, 20,18, 25,17, 20, 25,26) at position 3, since 
X and the subsequence (16,15,20,13,17) share the same relative order. An other 
occurrence of x in j/ is at position 10 (see FigQ. 

The OPPM problem finds applications in the fields where we are interested 
in finding patterns affected by relative orders, not by their absolute values. For 
example, it can be applied to time series analysis like share prices on stock 
markets, weather data or to musical melody matching of two musical scores. 

In the last few years some solutions have been proposed for the order¬ 
preserving pattern matching problem. The first solution was presented by Ku- 
bica et al. [7] in 2013. They proposed a 0{n + m\ogm) solution over generic 
ordered alphabets based on the Knuth-Morris-Pratt algorithm [8] and a 0{n+m) 
solution in the case of integer alphabets. Some months later Kim et al. [6] pre¬ 
sented a similar solution running in 0{n + mlogm) time based on the KMP 
approach. Although Kim et al. stressed some doubts about the applicability of 
the Boyer-Moore approach [T] to order-preserving matching problem, in 2013 
Cho et al. presented a method for deciding the order-isomorphism between 
two sequences showing that the Boyer-Moore approach can be applied also to 
the order-preserving variant of the pattern matching problem. More recently 
Chhabra and Tarhio [5] presented a more practical solution based on approx¬ 
imate string matching. Their technique is based on a conversion of the input 
sequences in binary sequences and on the application of any standard algorithm 
for exact string matching as a filtration method. 

In this paper we present two new families of filtering approaches which turn 
out to be much more effective in practice than the previously presented methods. 
While the technique proposed by Chhabra and Tarhio translates the input strings 



in binary sequences, our methods work on sequences over larger alphabets in 
order to speed up the searching process and reduce the number of false positives. 
From our experimental results it turns out that our proposed solutions are up to 
2 times faster than the previous solutions reducing the number of false positives 
up to 99% under suitable conditions. 

The paper is organized as follows. In Section we give preliminary notions 
and definitions relative to the order-preserving pattern matching problem while 
in Section we briefly describe the previous solutions to the problem. Then we 
present our new solutions in Section and evaluate their performances against 
the previous algorithms in Section Conclusions are drawn in Section 

2 Notions and Basic Definitions 

A string x over an ordered alphabet A, of size ct, is defined as a sequence of 
elements in S. We suppose that a total order relation “<” is defined on it, so 
that we could establish if a < 5 for each a,b G S. 

We indicate the length of a string x with the symbol \x\. We refer to the 
elements in x with the symbol a;[i], for 0 < i < |a;|. Moreover we indicate with 
x[i... j] the subsequence of x from the element of position i to the element of 
position j (including the extremes), for 0 < i < j < |x|. 

We say that two sequences x,y G S* are order isomorphic if the relative order 
of their elements is the same. More formally we give the following definition. 

Definition 1 (order isomorphism). Given an ordered alphabet S and two 
sequences x,y G S* of the same length, we say that x and y are order-isomorphic, 
and write x ks y, if the following conditions hold 

1. |a;| = |y| 

2. x[i] < x[j\ if and only if y[i] < y[j], for 0<i,j < \x\ 

Definition 2 (rank function). Let x be a sequence of length m over an or¬ 
dered alphabet S. The rank function of x if a mapping r : {0,1,..., m — 1} —>■ 
{0,1,... ,m — 1} such that x[r{i)] < x[r[j)] holds for each pair 0 < i < j < m. 
Formally we define 

r{i) = \{j : x[j] < x[i] or {x\j] = x[i\ and j < i)}| 

for 0 < i < m. 

We will refer to the value r{i) as the rank of x[i\ in x, while we will refer to the 
sequence (r(0), r(l),... r{m — 1)) as the relative order of x. 

According to Definition we have that a;[r(0)] is the smallest number while 
a:[r(m — 1)] is the greater number in x. If we assume that sort{x) is the time 
required to sort all the elements of x, then it is easy to observe that the relative 
order of x can be computed in 0(sort(x)) time. 

In addition, we define the equality function of x which indicates which ele¬ 
ments of the sequence are equal (if any). More formally we have the following 
definition. 


NODER-ISOMORPHISM(r, eq, y, i) 
1. for i •<— 0 to |a;| — 1 do 


2. if {y[r{i)] > y[r{j + i + 

3. if {y[r{i)] < y[r{j + i + 

4. if {y[r{{)] = y[r{j + i + 


if {y[r{i)] > y[r{j + i + 1)]) then return false 

if {y[r{i)] < y[r{j + i + 1)] and eq{i) = 1) then return false 

if {y[r{i)] = y[r{j + i + 1)] and eq{i) = 0) then return false 


5. return true 


return true 


Fig. 2. The function used to verify if two sequences x and y\i.. A -\- |x| — 1] are order 
isomorphic. We assume that the function receives as input the parameter r and eq 
which represent the rank function and the equality function of x, respectively. 


Definition 3 (equality function). Let x he a sequence of length m over an 
ordered alphabet S and let r be the rank function of x. The equality function of 
X if a mapping eg : {0,1,..., m — 2} —?> {0,1} such that, for each 0 < i < m 




otherwise 


Let r be the rank function of a string x, such that m = |a;|, and let q be its 
equality function. It is easy to prove that x and y are order isomorphic if and 
only if they share the same rank and equality function, i.e. if and only if the 
following two conditions hold 

1. y[r{i)\ < y[r{i + 1)], for 0 < i < m — 1 

2. y[r{i)\ = y[r{i + 1)] if and only if q{i) = 1, for 0 < f < m — 1 

Example 1. Let x = (6, 3,8,3,10, 7,10) and y = (2,1,4,1, 5, 3, 5) two sequences 
of size 7. We have that the relative order of x is (1, 3, 0, 5,2,4, 6) while its equal¬ 
ity function is eq{x[i]) = (1, 0, 0,0,0,1). The two string are order isomorphic 
according to the definition given above, i.e. x ~ y. 

The procedure to verify that two numeric sequences, x and y, are order 
isomorphic is shown in Fig[^ It receives as input the functions r and q, computed 
on x and returns a boolean value indicating \i x ~ y. The algorithm requires 
0{m) time, where m is the length of the sequences. A mismatch occurs when 
one of the three conditions of lines 2, 3 and 4, holds. 

The OPPM problem consists in finding all substring of the text with the same 
relative order as the pattern. Specifically we have the following formal definition. 

Definition 4 (order preserving pattern matching). Let x and y be two 

sequences of length m and n, respectively (and n > m), both over an ordered 
alphabet E. The order preserving pattern matching problem consists in finding 
all indexes i, with 0 < i < n — m, such that y[i.. A + m — 1] x. 

If an occurrence of the pattern x starts at portion i of the text y, we say that 
X has an order-preserving occurrence at position i. 


3 Previous Results 


The OPPM problem has drawn particular attention in the last few years, during 
which some efficient results have been proposed. 

The first algorithm to solve the OPPM problem was presented by Kubica et 
al. in [7j. Their solution was an adaptation of the well Known Knuth-Morris- 
Pratt algorithm for the exact string matching problem, where the fail function is 
adapted to compute the order-borders table. The authors proved that this table 
can be computed in linear time in the length of the pattern x, if the relative 
order of x is known in advance. The overall time complexity of the algorithm is 
0{n + mlogm), where m is the length of the pattern while n is the length of 
the text. However in [3] Cho et al. proved that the algorithm presented in [7] 
can decide incorrectly when there are equal values in the string. 

The second algorithm based on Knuth-Morris-Pratt was presented later by 
Kim et al. [B]. Their algorithm is based on the prefix representation and it is 
further optimized according to the nearest neighbor representation. The prefix 
representation is based on finding the rank of each integer in the prefix. It can be 
computed easily by inserting each character to the dynamic order statistic tree 
and then computing the rank of each character in the prefix. The time complexity 
of computing such prefix representation is 0(m log m). The failure function is 
then computed as in the Knuth-Morris-Pratt algorithm in 0(m log m) time. The 
overall time complexity of this algorithm is 0(n-|-mlogm). Again, this solution 
does not work properly when there are equal values in the pattern. 

The first sublinear solution for the OPPM problem was presented by Cho et 
al. in [3]. Their algorithm is an adaptation to OPPM of the well known Boyer- 
Moore approach. They apply a g-grams technique, i.e. groups of q consecutive 
characters are treated as a single condensed character, in order to make the shifts 
longer. In this way, a large amount of text can be skipped for long patterns. 

More recently Chhabra and Tarhio presented a new practical solution [2] 
based on a filtration technique. Their algorithm translates the input sequences 
in two binary sequences and then use any standard exact pattern matching 
algorithm as a filtration procedure. In particular in their approach a sequence s 
is translated in a binary sequence j3 of length |s| — 1 according to the following 
position 


m 


1 if s[f] > s[i -I- 1] 
0 otherwise 


( 1 ) 


for each 0 < i < |s| — 1. This translation is unique for a given sequence s and can 
be performed on line on the text, requiring constant time for each text character. 

Thus when a candidate occurrence is found during the filtration phase an ad¬ 
ditional verification procedure is run in order to check for the order-isomorphism 
of the candidate substring and the pattern. Despite its quadratic time complex¬ 
ity, this approach turns out to be simpler and more effective in practice than 
earlier solutions. It is important to notice that any algorithm for exact string 
matching can be used as a filtration method. The authors also proved that if the 
underlying filtration algorithm is sublinear and the text is translated on line, the 


overall complexity of the algorithm is sublinear on average. Experimental results 
conducted in [5] show that the filter approach was considerably faster than the 
algorithm by Cho et al. 

For the sake of completeness we notice that Crochemore et al. presented in [3] 
a solution for the offline version of the OPPM problem based on a new data 
structure called order-preserving suffix tree. Their solution finds all occurrences 
oi X m.y in 0{{m log n)/ log log m + z) where z is the number of occurrences of x 
in y. In this paper we concentrate on the online version of the OPPM problem. 


4 New Efficient Filter Based Algorithms 

In this section we present two new general approaches for the OPPM problem. 
Both of them are based on a filtration technique, as in |2] , but we use information 
extracted from groups of integers in the input string, as in [3], in order to make 
the filtration phase more effective in terms of efficiency and accuracy, as discussed 
below. 

Text filtration is a largely used technique in the field of exact and approximate 
string matching. Specifically, instead of checking at each position of the text if the 
pattern occurs, it seems to be more efficient to filter text positions and check only 
when a substring looks like the pattern. When a resemblance has been detected 
a naive check of the occurrence is performed. In literature filtration techniques 
are generally improved by using g-grams, i.e. groups of adjacent characters of 
the string which are considered as a single character of a condensed alphabet. 

It is always convenient to use a filtration method which better and faster lo¬ 
calize candidate occurrences, which imply accuracy and efficiency of the method, 
respectively. 

The accuracy of a filtration method is a value indicating how many false 
positives are detected during the filtration phase, i.e. the number of candidate 
occurrences detected by the filtration algorithm which are not real occurrences 
of the pattern. The effciency is instead related with the time complexity of the 
procedure we use for managing g-grams and with the time efficiency of the overall 
searching algorithm. It is clear that these two values are strongly related since 
a low accuracy implies an high number of false positives and, as a consequence, 
a decrease in the performance of the searching algorithm. 

When using g-grams, a great accuracy translates in involving greater values 
of q. However, in this context, the value of q represents a trade-off between the 
computational time required for computing the g-grams for each window of the 
text and the computational time needed for checking false positive candidate 
occurrences. The larger is the value of g, the more time is needed to compute 
each g-gram. On the other hand, the larger is the value of q, the smaller is the 
number of false positives the algorithm finds along the text during the filtration. 

In our approaches we make use of the following definition of (/-neighborhood 
of an element in an integer string. 


Definition 5 (g-neighborhood). Given a string x of length m, we define the 
q-neighborhood of the element x[i], with 0 < i < m — q, as the sequence of q-\-\ 
elements from position i toi + qin x, i.e. the sequence {x[i],x[i + \],... ,x[i-\-q\). 

Both the filtration methods presented below translate the input sequence 
in a target numeric sequence which is used for the filtration. Specifically each 
position i of the sequence is associated with a numeric value computed from the 
structure of the ^-neighborhood of the element x[i]. 


4.1 The Neighborhood Ranking Approach 


Given a string x of length m, we can compute the relative position of the element 
x[i] compared with the element x[j] by querying the inequality x[i] > x[j]. For 
brevity we will write in symbol fixifij) to indicate the boolean value resulting 
from the above inequality, extending the formal definition given in Equation 0 - 
Formally we have 


r 1 if x[i] > x[j] 
\ 0 otherwise 


( 2 ) 


It is easy to observe that if j) = 1 we have that r{i) > r{j) {x[j] precedes 
x[i\ in the ordering of the elements of x), otherwise r{i) < r{j). 

The neighborhood ranking ( nr ) approach associates each position i of the 
string X (where 0 < i < m — q) with the sequence of the relative positions 
between x[i] and x[i+ j], for j = 1,..., q. In other words we compute the binary 
sequence {fixii, *+1), Px{i, *+2), •. •, Px{i, *+?)) of length q indicating the relative 
positions of the element x[i] compared with other values in its g-neighborhood. 
Of course, we do not include in the sequence the relative position of (3{i, i), since 
it doesn’t give any additional information. 

Since there are 2'* possible configurations of a binary sequence of length q 
the string x is converted in a sequence Xx of length m — q, where each element 
for 0 < i < m — g, is a value such that 0 < x%[i\ < 2'^- 
More formally we have the following definition 


Definition 6 (g-NR sequence). Given a string x of length m and an integer 
q < m, the g-NR sequence associated with x is a numeric sequence Xx of length 
m — q over the alphabet {0,..., 2'?} where 


9 

Xl[i] = X! (/3®(b* + j) X 2'^"-’) , for alio <i <m-q 


Example 2. Let x = (5, 6, 3,8,10, 7,1, 9,10, 8) be a sequence of length 10. The 4- 
neighborhood of the element a:[2] is the subsequence (3,8,10, 7,1). Observe that 
x[2] is greater than a:[6] and less than all other values in its 4-neighborhood. Thus 
the ranking sequence associated with the element of position 2 is (0,0, 0,1) which 
translates in a nr value equal to 1. In a similar way we can observe that the NR 
sequence associated with the element of position 3 is (0,1,1,0) which translates 
in a NR value equal to 6. The whole 4-NR sequence of length 6 associated to x is 
X4 = (4,8,1,6,15,8). 


Neighborhood Ranking 

Example 

NR SEQ. 

xm 

x[i\ < x[i + I\,x[i + 2],a:[z -|- 3] 


(0,0,0) 

0 

x[i -f 3] < x[i\ < x[i + l],a:[z -f 2] 


(0,0,1) 

1 

x[i -f 2] < xli] < x[i + l\,x[i -I- 3] 


(0,1,0) 

2 

x[i ■+ 2], x[z + 3] < x[i\ < x[i -I-1] 


(0,1,1) 

3 

x[i -f 1] < x[i\ < x[i + 2\,x[i -I- 3] 


(1,0,0) 

4 

x[i ■+ 1], z[z + 3] < x[i\ < x[i -I- 2] 


(1,0,1) 

5 

x[i ■+ I\,x[i + 2] < x[i\ < x[i -I- 3] 


(1,1,0) 

6 

x[i -I- I\,x[i + 2],x[i + 3] < x[i\ 


(1.1,1) 

7 


Fig. 3. The 2® possible 3-neighborhood ranking sequences associated with element x[i\, 
and their corresponding nr value. In the leftmost column we show the ranking position 
of x[i\ compared with other elements in its neighborhood {x[i],x[i-\-l], a;[i-|-2], a;[i-|-3]). 


The following Lemma and Corollary prove that the nr approach can be 
used to filter a text y in order to search for all order preserving occurrences of a 
pattern x. In other words it proves that 

{i \ X ^ y[i.. A + m - 1]} ^ {i \ Xl = Xy[* ■ ■ A + m - k]}. 

Lemma 1. Let x and y be two sequences of length m and let Xx Xy ^^6 

q-ranking sequences associated to x and y, respectively. If x ~ y then Xx = Xy- 

Proof. Let r be the rank function associated to x and suppose by hypothesis 
that X ~ y. Then the following statements hold 

1. by Definitionwe have x\r{i)] < x[r{i -+- 1)], for 0 < i < to — 1; 

2. by hyphotesis and Defjl] y[r{i)] < y[r{i -hi)], for 0 < i < to — I; 

3. then by 1 and 2, x[i] < x[j] iff y[i\ < y[j], for 0 < *, j < to — 1; 

4-. the previous statement implies that x[i] > x[i -I-j] iff y[i] > y[i -I-j] 
for 0 < z < TO — g and 1 < j < g; 

5. by statement f we have that I3x{i,i + j) = Pyifj + j) 
for 0 < z < TO — g and 1 < j < g; 

6. finally, by 5 and Definition we have x^[z] = XyMi for 0 < z < to — g. 

This last statement proves the thesis. ■ 

The following corollary prices that the NR approach can be used as a filtering. 
It trivially follows from Lemma [l] 







COMPUTE-NR-VALUE(a:, i, q) 

1 . 5^0 

2. for j 1 to g do 

3. 5 = {5 <^1) + Px{i,i +j) 

4. return 5 

Fig. 4. The function which computes the g-neighborhood ranking value of the element 
of position i in a sequence x. The value id computed in 0{q) time. 


Corollary 1. Let x and y he two sequences of length m and n, respectively. 
Let Xx Xy the q-ranking sequences associated to x and y, respectively. If 
xKiy[j ...j + m-l] then x«[z] = + i], for Q < i < m - q. ■ 

Fig. ID shows the procedure used for computing the NR value associated with 
the element of the string x at position i. The time complexity of the procedure 
is 0{q). Thus, given a pattern x of length m, a text y of length n and an 
integer value g < m, we can solve the OPPM problem by searching Xy for all 
occurrences of x%^ using any algorithm for the exact string matching problem. 
During the preprocessing phase we compute the sequence x% and the functions 
Xx and qx. When an occurrence of Xx is found at position i the verification 
procedure NODER-IsOMORPHiSM(r, g, j/, i) (shown in Figj^ is run in order to 
check a X y[i... i + m — 1]. 

Since in the worst case the algorithm finds a candidate occurrence at each 
text position and each verification costs 0{m), the worst case time complexity 
of the algorithm is 0{nm), while the filtration phase can be performed with a 
0{nq) worst case time complexity. However, following the same analysis of [2], we 
easily prove that verification time approaches zero when the length of the pattern 
grows, so that the hltration time dominates. Thus if the filtration algorithm is 
sublinear, the total algorithm is sublinear. 

4.2 The Neighborhood Ordering Approach 

The neighborhood ranking approach described in the previous section gives par¬ 
tial information about the relative ordering of the elements in the g-neighborhood 
of an element in x. The g binary sequence used to represent each element x[i\ is 
not enough to describe the full ordering information of a set of g -I- 1 elements. 

The g-neighborhood ordering (no) approach, which we describe in this sec¬ 
tion, associates each element of the x with a binary sequence which completely 
describes the ordering disposition of the elements in the g-neighborhood of x[i]. 
The number of comparisons we need to order a sequence of g -I- 1 elements is 
between g (the best case) and g(g -I- l)/2 (the worst case). In this latter case it 
is enough to compare the element x[j], where i < j < i + q, with each element 
where j < h < i + q. 

Thus each element of position i in a;, with 0 < i < m — g, is associated with a 
binary sequence of length g(g-|-l)/2 which completely describes the relative order 


Neighborhood Ordering 

Example 

NO SEQ. 

‘piA 

{x[i],x[i -1- l],a;[z -1- 2]) 


(0,0,0) 

0 

{x[i],x[i -1- 2],a;[z -I- 1]) 


(0,0,1) 

1 

{x[i -\- 2],x[i],x[i -1- 1]) 


(0,1,1) 

3 

{x[i -1- l],a;[z],a;[z -1- 2]) 


(1,0,0) 

4 

{x[i -1- l],a;[z -1- 2],x[i]) 


(1,1,0) 

6 

{x[i -\- 2],x[i -1- l],a;[z]) 


(1,1,1) 

7 


Fig. 5. The 3! possible ordering of the sequence {x[i], x\i + 1], a;[i + 2]) and the corre¬ 
sponding binary sequence + l),l3x(i,i + 2), + l,i + 2)). 


COMPUTE-NO-VALUE(a:, i, q) 

1 . ( 5^0 

2. for fc g downto 1 do 

3. for j 1 to fc do 

4. S = {S ^ 1) + pxii + q — k,i + q — k + j) 

5. return 5 


Fig. 6. The function which computes the g-neighborhood ranking value of the element 
of position i in a sequence x. The value is computed in 0{q^) time. 


of the susequence x[i,... + q\. Since there are {q + 1)! possible permutations 

of a set of g -I- 1 elements, the string x is converted in a sequence of length 
m — q, where each element is a value such that 0 < < q{q + l)/2. 

More formally we have the following definition 

Definition 7 (g-NO sequence). Given a string x of length m and an integer 
q < m, the q-NO sequence associated with x is a numeric sequence (pf of length 
m — q over the alphabet {0,..., q{q + l)/2} where 

q 

^ + q — k]x , for alio <i < m — q (3) 

k=l 

Thus the g-NO value associated to x[i\ is the combination of q different NR 
sequences xlA, + 1], ■ • ■, + 9 - !]• 

For instance the 4-NO value associated to x[i\ is computed as 

vtA = xiA X 2® -I- Xx[* + 1] ^ 2^ -I- + 2] X 2 -I- [z -I- 3] 




Example 3. As in Example^ let x = (5,6,3, 8,10, 7,1,9,10,8) be a sequence of 
length 10. The 3-neighborhood of the element a:[3] is the subsequence (8,10, 7,1). 
The NO sequence of length 6 associated with the element of position 2 is therefore 
(0,1,1,1,1,1) which translates in a NO value equal to (^a;[3] = 31. In a similar 
way we can observe that the NR sequence associated with the element of position 
2 is (0, 0, 0,0,1,1) which translates in a NO value equal to (^^[2] = 3. The whole 
sequence of length 7 associated to a; is = (20, 32, 3,31,60,32,3). 

The following Lemma and Corollary prove that the NO approach can be 
used to filter a text y in order to search for all order preserving occurrences of a 
pattern x. In other words they prove that 

{i I X Ki y[i . . .1 + m — V\\ Q {i | ... * -I- m — k]}. 

Lemma 2. Let x and y be two sequences of length m and let ipf and the 

q-ranking sequences associated to x and y, respectively. If x ~ y then p% = ‘Py- 

Proof. The theorem easily follows from Definition and Lemma ■ 

The following corollary proves that the NR approach can be used as a filtering. 
It trivially follows from Lemma 

Corollary 2. Let x and y be two sequences of length m and n, respectively. 
Let Xx o,nd Xy the q-ranking sequences associated to x and y, respectively. If 

X ~ y[j ... j m — 1] then Xx [*] = Xy [i + *]; for 0 < i < m — q. ■ 

Fig. 1^ shows the procedure used for computing the NO value associated with 
the element of the string x at position i. The time complexity of the procedure 
is 0{q^). Thus, given a pattern x of length m, a text y of length n and an 
integer value q < m, we can solve the OPPM problem by searching for all 
occurrences of pi, using any algorithm for the exact string matching problem. 
During the preprocessing phase we compute the sequence pf and the functions 
Vx and qx. When an occurrence of pi is found at position i the verihcation 
procedure NODER-IsOMORPHiSM(r, q, j/, i) (shown in Figj^ is run in order to 
check if X y[i... i m — 1]. 

Also in this case, if the hltration algorithm is sublinear on average, the NO 
approach has a sublinear behavior on average. 

5 Experimental Evaluations 

In this section we present experimental results in order to evaluate the perfor¬ 
mances of our new filter based algorithms presented in this paper. In particu¬ 
lar we tested our Hlter approaches against the filter approach of Chhabra and 
Tarhio [2], which is, to the best of our knowledge, the most effective solution in 
practical cases. In the experimental evaluation conducted in [2] the SBNDM2 and 
SBNDM4 algorithms [5] turned out to be the most effective exact string matching 
algorithms which can be used in combination with the filter technique. Following 


the same line, in our experimental evaluation we use in all cases the SBNDm2 
algorithm. However any other exact string matching algorithm could be used for 
this purpose. In our dataset we use the following names to identify the tested 
algorithms 

— Fct: the SBNDm2 algorithm based on the filter approach by Chhabra and 
Jorma Tarhio presented in [5]; 

— NRq: the SBNDm2 algorithm based on the Neighborhood Ranking approach 
presented in Section [4H] 

— Noq: the SBNDM2 algorithm based on the Neighborhood Ordering approach 
presented in Section [4^ 

We do not compare our solution with the Boyer-Moor approach by Cho et 
al. [3] since it was shown to be less efficient than the algorithm by Chhabra and 
Tarhio in all cases. We evaluated our filter based solutions in terms of efficiency, 
i.e. the running times, and accuracy, i.e. the percentage of false positives detected 
during the filtration phase. In particular for the Fct algorithm we will report the 
average running times, in milliseconds, and the average number of false positives 
detected every 2^° text characters. Instead, for all other algorithms in the set, 
we will report the following two values 

— the speed up of the running times obtained when compared with the time 
used by the Fct algorithm. If time{FCT) is the running time of the Fct 
algorithm and t is the running time of our algorithm, then the speed up is 
computed as time{FCT)/t. 

— the percentage of the gain in the number of false positives detected by the 
algorithm when compared with the Fct algorithm. If fp{FCT) is the number 
of false positives detected on average by the Fct algorithm and fp is number 
of false positives detected by our filter approach, then the gain is computed 
as (100 X (/p(Fct) - fp)/fp(FcT). 

We tested our solutions on sequences of short integer values (each element 
is an integer in the range [0... 256]), long integer values (where each element is 
an integer in the range [0... 10.000]) and floating point values (each element is 
a floating point in the range [0.0... 10000.99]). However we don’t observe sen¬ 
sible differences in the results, thus in the following table we report for brevity 
the results obtained on short integer sequences. All texts have 1 million of ele¬ 
ments. In particular we tested our algorithm on the following set of short integer 
sequences. 

— Rand-( 5: a sequence of random integer values ringing around a fixed mean 
equal to 100. Each value of the sequence is randomly chosen around the mean 
with a variability of S, so that the text can be seen as a random sequence of 
integers between 100 — 6 and 100 -I- S with a uniform distribution. 

— Period-(5: a sequence of random integer values ringing around a periodic 
function with a period of 10 elements. Each value of the sequence is randomly 
chosen around the function with a variability of 6. All values of the sequences 
are always in the range {0 ... 200 -I- d}. 


m 

Fct 

Nr 2 

Nr 3 

Nr 4 

Nr 5 

Nr6 

No 2 

No 3 

No 4 

8 

44.29 

1.16 

1.28 

1.25 

1.25 

1.24 

1.89 

1.71 

1.11 

12 

28.39 

1.16 

1.37 

1.37 

1.33 

1.19 

1.64 

2.00 

1.64 

16 

20.65 

1.15 

1.30 

1.43 

1.34 

1.14 

1.42 

2.01 

1.83 

20 

16.29 

1.15 

1.30 

1.45 

1.41 

1.14 

1.39 

2.00 

1.93 

24 

13.64 

1.16 

1.29 

1.42 

1.44 

1.12 

1.34 

1.91 

2.01 

28 

11.48 

1.16 

1.28 

1.44 

1.45 

1.11 

1.31 

1.88 

1.96 

32 

10.34 

1.18 

1.30 

1.40 

1.46 

1.12 

1.30 

1.83 

2.05 

8 

15713.46 

84.1 

92.4 

95.1 

94.0 

90.2 

97.5 

99.1 

99.6 

12 

1420.78 

95.8 

99.3 

99.7 

99.8 

97.5 

99.8 

100.0 

100.0 

16 

123.22 

99.4 

100.0 

100.0 

100.0 

99.7 

100.0 

100.0 

100.0 

20 

12.07 

100.0 

100.0 

100.0 

100.0 

100.0 

100.0 

100.0 

100.0 

24 

1.01 

100.0 

100.0 

100.0 

100.0 

100.0 

100.0 

100.0 

100.0 

28 

0.02 

100.0 

100.0 

100.0 

100.0 

100.0 

100.0 

100.0 

100.0 

32 

0.00 

- 



- 

- 

- 


- 


Table 1. Experimental results on a Rand- 5 short integer sequence. 


m 

Fct 

Nr 2 

Nr 3 

Nr 4 

Nr 5 

Nr 6 

No 2 

No 3 

No 4 

8 

42.34 

1.13 

1.27 

1.25 

1.26 

1.22 

1.92 

1.68 

1.08 

12 

27.93 

1.17 

1.40 

1.37 

1.32 

1.21 

1.71 

2.04 

1.63 

16 

20.05 

1.15 

1.32 

1.41 

1.33 

1.15 

1.48 

2.04 

1.81 

20 

15.85 

1.15 

1.29 

1.42 

1.37 

1.11 

1.38 

2.00 

1.90 

24 

13.31 

1.17 

1.31 

1.47 

1.42 

1.12 

1.36 

1.99 

2.02 

28 

11.38 

1.17 

1.31 

1.42 

1.45 

1.09 

1.35 

1.94 

2.07 

32 

9.96 

1.16 

1.29 

1.45 

1.46 

1.09 

1.29 

1.87 

2.09 

8 

14326.78 

83.6 

92.3 

95.6 

92.9 

90.2 

97.7 

99.3 

99.7 

12 

1295.88 

96.4 

99.5 

99.9 

99.9 

97.8 

99.9 

100.0 

100.0 

16 

118.79 

99.3 

100.0 

100.0 

100.0 

99.7 

100.0 

100.0 

100.0 

20 

10.43 

100.0 

100.0 

100.0 

100.0 

100.0 

100.0 

100.0 

100.0 

24 

0.71 

100.0 

100.0 

100.0 

100.0 

100.0 

100.0 

100.0 

100.0 

28 

0.00 

- 



- 

- 

- 


- 

32 

0.00 

- 



- 

- 

- 


- 


Table 2. Experimental results on a Rand- 20 short integer sequence. 


For each text in the set we randomly select 100 patterns extracted from the 
text and compute the average running time over the 100 runs. We also computed 
the average number of false positives detected by the algorithms during the 
search. All the algorithms have been implemented using the C programming 
language and have been compiled on an MacBook Pro using the gcc compiler 
Apple LLVM version 5.1 (based on LLVM 3.4svn) with 8Gb Ram. During the 
compilation we use the -03 optimization option. 

In the following table running times are expressed in milliseconds. Best results 
have been underlined. 


Experimental Results on Random Sequences 


Experimental results on Rand-5 numeric sequences have been conducted with 
values of S = 5,20,40 (see Table Table and Table [^. The results show 



























m 

Fct 

Nr 2 

Nr 3 

Nr 4 

Nr 5 

Nr 6 

No 2 

No 3 

No 4 

8 

42.62 

1.16 

1.28 

1.28 

1.25 

1.25 

1.94 

1.70 

1.09 

12 

28.35 

1.19 

1.41 

1.39 

1.36 

1.21 

1.75 

2.06 

1.65 

16 

20.37 

1.18 

1.32 

1.44 

1.37 

1.17 

1.49 

2.09 

1.83 

20 

16.12 

1.15 

1.29 

1.46 

1.39 

1.12 

1.39 

2.04 

1.95 

24 

13.35 

1.18 

1.30 

1.46 

1.44 

1.13 

1.36 

1.97 

1.99 

28 

11.60 

1.18 

1.32 

1.47 

1.50 

1.14 

1.37 

1.96 

2.06 

32 

10.06 

1.16 

1.29 

1.45 

1.48 

1.10 

1.33 

1.89 

2.07 

8 

15413.57 

86.6 

93.7 

95.9 

94.4 

91.9 

98.1 

99.4 

99.8 

12 

1492.39 

97.0 

99.6 

99.9 

99.9 

98.1 

99.9 

100.0 

100.0 

16 

114.82 

99.3 

100.0 

100.0 

100.0 

99.7 

100.0 

100.0 

100.0 

20 

9.83 

100.0 

100.0 

100.0 

100.0 

100.0 

100.0 

100.0 

100.0 

24 

0.83 

100.0 

100.0 

100.0 

100.0 

100.0 

100.0 

100.0 

100.0 

28 

0.00 

- 



- 

- 

- 


- 

32 

0.00 

- 



- 

- 

- 


- 


Table 3. Experimental results on a Rand- 40 short integer sequence. 


m 

Fct 

Nr 2 

Nr 3 

Nr 4 

Nr 5 

Nr 6 

No 2 

No 3 

No 4 

8 

41.08 

0.99 

1.05 

0.88 

0.79 

0.90 

0.88 

0.73 

0.60 

12 

36.42 

1.06 

1.02 

0.94 

0.86 

0.91 

0.81 

0.67 

0.69 

16 

34.03 

1.04 

0.86 

0.78 

0.74 

1.00 

0.77 

0.64 

0.60 

20 

35.31 

0.98 

0.89 

0.88 

0.84 

0.92 

0.73 

0.60 

0.55 

24 

37.90 

1.34 

1.33 

1.30 

1.18 

1.15 

0.99 

0.82 

0.76 

28 

36.26 

1.17 

1.09 

1.10 

1.04 

0.97 

0.78 

0.64 

0.56 

32 

35.38 

1.10 

1.15 

1.05 

0.95 

0.94 

0.82 

0.65 

0.59 

8 

48697.90 

78.1 

78.2 

75.0 

61.8 

83.0 

89.1 

94.2 

95.8 

12 

45427.73 

66.4 

72.8 

74.6 

71.4 

67.7 

76.3 

82.4 

84.9 

16 

32091.18 

54.1 

63.6 

66.0 

63.9 

55.3 

66.3 

72.7 

74.5 

20 

26337.31 

41.0 

49.0 

52.6 

53.6 

43.0 

53.4 

59.1 

61.5 

24 

23100.22 

42.3 

56.9 

61.6 

62.5 

44.0 

60.5 

66.7 

69.6 

28 

23296.19 

53.2 

63.0 

70.7 

73.1 

55.1 

65.8 

73.8 

76.7 

32 

17959.33 

49.7 

66.6 

72.0 

75.6 

50.5 

68.9 

75.2 

79.4 


Table 4. Experimental results on a Period- 5 short integer sequence. 


as the No approach is the best choice in all cases, achieving a speed up of 2.0 
if compared with the Fct algorithm. Also the Nr approach achieves always a 
good speed up which is between 1.15 and 1.50. The gain in number of detected 
false positives is impressive and is in most cases between 90% and 100%. It is 
interesting to observe also that the value of d do not affect the running times and 
the number of false positives detected during the search, which are very similar 
in the three tables. 


Experimental Results on Periodic Sequences 

Experimental results on Period-i5 problem have been conducted on a periodic 
sequence with a period equal to 10 and with <5 = 5 (see Table |^. The results 
show as the NrI approach is the best choice in most of the cases, achieving a 
speed up of 1.3 in suitable conditions. However in some cases the Fct algorithm 
























m 

Fct 

Nr 2 

Nr 3 

Nr 4 

Nr 5 

Nr 6 

No 2 

No 3 

No 4 

8 

42.35 

0.98 

1.18 

0.91 

0.81 

0.89 

1.02 

0.83 

0.68 

12 

39.09 

1.11 

1.14 

1.06 

0.98 

1.00 

1.02 

0.88 

0.93 

16 

34.25 

1.11 

1.01 

1.02 

1.01 

1.08 

0.96 

0.87 

0.87 

20 

35.41 

1.10 

1.09 

1.21 

1.21 

1.07 

0.97 

0.89 

0.89 

24 

35.15 

1.31 

1.51 

1.67 

1.60 

1.14 

1.15 

1.10 

1.18 

28 

32.23 

1.23 

1.40 

1.56 

1.36 

1.07 

1.04 

1.08 

1.15 

32 

30.34 

1.43 

1.60 

1.53 

1.43 

1.22 

1.19 

1.11 

1.07 

8 

62122.44 

56.9 

77.8 

71.5 

57.1 

60.9 

84.7 

91.8 

95.9 

12 

50264.79 

56.5 

72.8 

77.3 

76.7 

58.6 

77.0 

85.0 

88.1 

16 

32026.85 

60.0 

73.8 

79.4 

80.5 

62.4 

78.8 

86.3 

89.2 

20 

23138.04 

61.1 

77.4 

83.2 

86.3 

63.3 

81.2 

87.8 

91.3 

24 

16535.75 

65.1 

82.8 

88.6 

91.0 

68.0 

85.3 

91.3 

94.2 

28 

12181.13 

72.7 

85.4 

92.7 

94.9 

74.8 

88.8 

94.8 

96.8 

32 

9276.84 

75.2 

90.4 

94.2 

97.0 

76.1 

91.4 

95.4 

98.0 


Table 5. Experimental results on a Period- 20 short integer sequence. 


m 

Fct 

Nr 2 

Nr 3 

Nr 4 

Nr 5 

Nr 6 

No 2 

No 3 

No 4 

8 

45.07 

0.93 

1.18 

0.94 

0.81 

0.89 

1.12 

0.91 

0.78 

12 

37.91 

1.08 

1.12 

1.03 

0.93 

1.03 

1.13 

1.03 

1.08 

16 

32.41 

1.11 

1.04 

1.06 

1.13 

1.07 

1.07 

1.02 

1.10 

20 

28.63 

1.05 

1.09 

1.24 

1.35 

1.08 

1.04 

1.04 

1.15 

24 

27.25 

1.18 

1.39 

1.59 

1.53 

1.10 

1.12 

1.14 

1.40 

28 

24.91 

1.20 

1.51 

1.67 

1.41 

1.05 

1.17 

1.30 

1.50 

32 

23.63 

1.39 

1.63 

1.55 

1.31 

1.20 

1.27 

1.41 

1.41 

8 

61386.36 

50.0 

73.3 

67.7 

50.7 

56.3 

81.3 

89.0 

94.9 

12 

36298.84 

59.3 

76.3 

80.6 

82.1 

62.4 

81.8 

89.3 

93.2 

16 

19385.18 

70.4 

84.0 

88.8 

90.8 

72.8 

88.7 

94.2 

96.5 

20 

10325.29 

74.6 

88.3 

93.7 

96.1 

78.8 

92.9 

97.0 

98.5 

24 

6566.03 

82.4 

94.9 

97.5 

98.7 

84.9 

96.1 

98.4 

99.4 

28 

3141.06 

82.8 

94.4 

98.0 

99.1 

85.2 

96.2 

98.8 

99.5 

32 

2399.46 

88.3 

97.1 

99.1 

99.7 

89.6 

97.8 

99.3 

99.8 


Table 6. Experimental results on a Period- 40 short integer sequence. 


turns out to be the best choice especially on short patterns. The No approach 
is always less efficient of the Fct algorithm although the gain in number of 
detected false positives is always between 65% and 95%. This behavior is due to 
the high number of candidate occurrences detected by the algorithm, despite its 
gain in number of false positives, and to the relative effort in the construction 
of the filters values. 

When the size of 6 increases (see Table and Table ) the performances 
of the No approach get better achieving a speed up of 1.4 in the best cases. 
However the nr approach turns out to be always the best solutions with a speed 
up close to 1.7 for long patterns. 

The gain in number of false positives is always in the range between 50% and 
99.7% for the Nr algorithm, and between 80% and 99.8% in the case of the No 
algorithm. The gain of the No4 algorithm is in most cases close the 100%. 






















6 Conclusions 


In this paper we discussed the Order Preserving Pattern Matching Problem 
and presented two new families of filtering approaches to solve such problem 
which turn out to be much more effective in practice than the previously pre¬ 
sented methods. The presented methods translate the original sequence on new 
sequences over large alphabets in order to speed up the searching process and 
reduce the number of false positives. From our experimental results it turns out 
that our proposed solutions are up to 2 times faster than the previous solutions 
reducing the number of false positives up to 99% under suitable conditions. 
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